2016-12-28

Outline

  • What?
  • Why?
  • Where?
  • Automatic data pipeline
  • Data analysis
  • Visualization and Report
  • Cross functional collaboration

What is Unstructured Data?

HTML5 Icon

What is Unstructured Data?

Why?—It is everywhere

  • Open government data
  • Search engine data
  • Social media data

Why?—Practicalf arguments

  • Social media is impactful! [Amusing Ourselves to Death: Public Discourse in the Age of Show Business (1985)]
  • Financial resources are sparse
  • … and so is our time
  • Reproducibility

Where to get the data?

  • API: Twitter/Google/Wikipedia…
  • Webpage: Forum, Reviews
  • Survey
  • Interviews

API: Trump v.s Clinton Wikipedia View

Static Web: Wikipedia

Static Web: buzzfeed.com

Static Web: buzzfeed.com

Summary of packages

Automatic Data Pipeline

Data Analytics: Regular Expression

Data Analytics: Regular Expression

x <- c("here", "is", "P9929AMXT", "a", "P9703AM", "baby", 
    "P0506AM", "example", "P1197AM", "P1271AM")
idx <- grep("(^P)[[:digit:]]+", x)
x[idx]
## [1] "P9929AMXT" "P9703AM"   "P0506AM"   "P1197AM"   "P1271AM"

Data Analytics: Natural Language Processing

NLP: How does computer understand language?

Issue driven

  • If you don't know where to go ……

HTML5 Icon

Issue driven

  • If you don't know where to go …

HTML5 Icon

  • If you know where to go …

HTML5 Icon

NLP: What are you interested in?

NLP: What are you interested in?

## [1] "English is a crazy language"
## [1] "English muffins"

Marketing Campaign: #yieldhero

  • When is the best time to tweet
  • Who to target
  • Product mentioned
  • Sentiment score

#yieldhero Summary Statistics

  • From 2016-07-28 to 2016-11-18
  • There are 6914 tweets, 1930 original tweets
  • Products mentioned > 20 times
    • P1197AM
    • P0157AMX
    • P22T73R
    • P28T08R

When is the best time to tweet?

  • Total Tweet Counts

When is the best time to tweet?

  • Tweet and Re-tweet counts by time of the day

Who to target?

Network

Who to target?

Products Mentioned

Table of Products

Shiny App Example

library(shiny)
runApp('Rcode/Shiny_NLP')

Is web scraping legal?

Recommendation for your work

  • Encrypt sensitive personal identifiable information
  • YOU take all the responsibility for your web scraping work
  • If you publish data, do not commit copyright fraud
  • If in doubt, ask the author/creator/provider of data for permission
  • Consult current jurisdiction

Trick: robots.txt

  • What is robots.txt?

    "Robots Exclusion Protocol", informal protocol to prohibit web robots from crawling content

  • Located in the root directory of a website, e.g. http://baidu.com/robots.txt
  • Documents which bot is allowed to crawl which resources (and which not)
  • Not a technical barrier, but a sign that asks for compliance
  • Syntax in robots.txt
  • Scraping etiquette

Data and Code